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The  diminishing  differences  between  the  hardware  structure  of  shared  memory  and  mes¬ 
sage  passing  parallel  computers  mandate  a  new  evaluation  of  the  tradeoffs  these  architectures 
impose  on  the  development  and  performance  of  applications.  In  a  message  passing  computer, 
some  message  traffic  is  used  to  perform  interprocessor  updates  which  maintain  consistency 
between  the  various  processors'  data.  IV^jjpnsider  this  traffic  to  be  analogous  to  global  bus 
traffic  needed  in  a  shared  memory  computer  for  hardware  cache  consistency.  Using  Locus- 
Route.  a  global  router  for  standard  cells,  as  a  case  study,  we  investigate  the  level  of  traffic 
required  to  maintain  consistency  of  data  with  each  of  the  two  architectures.  By  explicitlv 
varying  the  frequency  of  interprocessor  updates,  the  level  of  traffic  in  the  message  passing 
approach  can  be  reduced  to  as  little  as  1%  of  the  traffic  in  the  shared  memory  approach 
while  still  obtaining  solution  quality  within  10%  of  the  quality  given  by  the  shared  memory 
version.  We  show  that  exploiting  locality,  in  the  way  wires  to  be  routed  are  assigned  to 
processors,  can  further  lower  this  message  traffic  by  as  much  as  67%.  However,  the  degree 
to  which  locality  can  be  exploited  may  be  limited  by  the  opposing  requirement  that  the 
application  be  load  balanced,  as  well  as  by  limited  locality  in  the  data  set. 
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In  recent  years,  there  has  been  much  debate  about  the  relative  merits  of  shared  memory  and 
message  passing  parallel  architectures.  The  previously  large  distinctions  between  the  two  ap¬ 
proaches.  however,  are  now  diminishing.  The  main  drawbacks  of  the  early  message  passing 
computers,  such  as  the  NCTBE/ten  [ll]  were  the  high  network  latency  and  the  large  message 
reception  overhead.  These  characteristics  forced  programmers  to  exploit  only  large  grain  par¬ 
allelism.  With  the  development  of  new  message  passing  computers,  such  as  the  Ametek  Series 
2010  and  the  Message  Driven  Processor  [3.9].  things  are  rapidly  changing.  Using  specialized 
routing  chips  and  the  technique  of  wormhole  routing  [6].  the  network  latencies  have  been  re¬ 
duced  by  2-3  orders  of  magnitude.  Similarly,  using  dedicated  hardware  to  copy  messages  to 
and  from  the  network  and  using  innovative  memory  mapping  techniques  [10.18].  the  message 
reception  overhead  has  been  cut  down  by  1-2  orders  of  magnitude.  These  reductions  enable 
current  machines  to  exploit  parallelism  at  a  fairly  fine  grain.  Furthermore,  it  is  now  possible  to 
approximate  aspects  of  the  shared-memory  model,  since  sending  a  message  to  a  remote  processor 
requesting  an  update  of  some  global  data  is  no  longer  an  unreasonably  long  operation. 
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On  the  other  hand,  we  see  that  efforts  to  scale  shared  memory  machines  led  them  to  resemble 
message  passing  computers  in  several  ways.  Traditional  shared  memory  architectures  used  a 
shared  global  bus  to  memory  [5]  and  could  only  accommodate  a  limited  number  of  processors 
before  the  global  bus  connecting  processors  and  memory  became  saturated.  Because  of  the 
limited  scalability  of  these  architectures,  designers  of  shared  memory  machines  are  now  turning 
to  architectures  using  processor  clusters  with  directory- based  cache  coherence  schemes  between 
the  clusters  [19.1].  hierarchical  architectures  with  more  than  one  level  of  shared  buses  such  as 
the  Encore  I'hraMax  [13.4].  or  architectures  with  multistage  processor  interconnetftioiLfletworks^  ^ 
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like  the  BB.N'  Butterfly  (12]  and  IBM  RP3  [16].  In  directory  based  approaches,  consistency 
operations  between  dusters  are  performed  on  a  point-to-point  basis,  with  invalidations  going 
only  to  the  clusters  that  need  them.  Also,  the  latency  of  non-local  communication  can  be  as 
much  as  an  order  of  magnitude  larger  than  that  of  a  local  access  in  all  of  these  machines.  These 
two  characteristics,  point  to  point  communication  and  high  non-local  reference  latency,  increase 
the  cost  of  non-local  traffic  in  a  shared  memory  approach,  and  force  the  programmer  of  these 
shared  memory  machines  to  consider  more  carefully  the  effect  of  locality. 

While  the  gap  between  the  two  architectures  is  narrowing,  there  are  still  fundamental  differ¬ 
ences  between  the  two  which  force  tradeoffs  between  the  two  architectures.  The  shared  memory 
architecture  considered  in  this  paper  has  a  single  global  address  space  with  hardware  to  guar¬ 
antee  the  consistency  of  data  in  the  processor  caches.  In  this  case,  cache  coherence  protocols 
enforce  consistency  of  data  [2].  In  the  message  passing  architectures,  each  processor  has  a  sep¬ 
arate  address  space,  and  exchanges  of  information  occur  by  sending  messages  on  the  network. 
One  type  of  message  passed  on  the  network  is  for  updating  distributed  data  held  by  all  the 
processors  to  periodically  make  it  consistent  with  the  other  processors'  data.  This  message  traf¬ 
fic  for  updates  in  the  message  passing  case  is  analogous  to  the  cache  coherence  traffic  used  in 
the  shared  memory  case.  In  the  message  passing  case,  this  traffic  is  explicitly  controlled  by  the 
application  programmer,  who  decides  when  and  how  to  perform  updates  between  processors.  In 
the  shared  memory  case,  the  traffic  is  implicitly  controlled  bv  the  cache  consistency  hardware, 
which  relieves  the  programmer  from  the  process  of  maintaining  data  consistency. 

In  many  cases,  however,  the  level  of  consistency  enforced  by  the  shared  memory  computer 
may  be  more  than  is  needed  by  a  particular  application.  In  such  cases  the  message  passing  com¬ 
puter  may  be  superior,  because  it  allows  the  application  programmer  to  control  the  degree  of 
consistency  explicitly.  In  this  paper,  we  explore  several  such  tradeoffs  between  shared-memory 
and  message  passing  architectures  using  the  LocusRoute  [17]  application  as  a  case  study.  (Lo- 
cusRoute  is  a  commercial  quality  routing  program  for  standard  cells  and  is  now  being  widely 
used  at  Stanford  for  parallel  processing  studies.)  Our  results  comparing  network  traffic  show 
that  the  message  passing  version  of  the  program  generates  only  1%  of  the  traffic  that  the  shared 
memory  version  does,  while  the  degradation  in  the  quality  of  the  routing  is  less  than  10VT .  An¬ 
other  issue  this  paper  addresses  is  the  sensitivity  of  the  network  traffic  to  the  exploitation  of 
locality  in  the  data  set.  In  our  study,  we  exploit  locality  by  assigning  wires  which  are  physically 
close  to  each  other  in  the  circuit,  to  the  same  processor.  Specifically,  we  show  that  message 
passing  architectures  can  reduce  their  network  traffic  by  more  than  50%  by  exploiting  locality, 
while  also  improving  solution  quality.  The  effect  on  shared  memory  architectures,  while  not  so 
large,  is  also  significant.  Some  other  issues  regarding  exploiting  of  locality  and  execution  time 
are  also  discussed. 


The  rest  of  the  paper  is  structured  as  follows.  The  next  section  gives  information  about 
the  LocusRoute  application  and  the  simulation  tools  used  to  collect  data  for  the  architectural 
comparison.  Section  3  describes  the  changes  made  to  the  original  shared  memory  LocusRoute 
to  convert  it  to  a  message  passing  style.  Section  4  presents  our  results  on  network  traffic  for 
the  two  architectures  under  varying  assumptions,  and  Section  5  shows  the  reduction  of  network 
traffic  made  possible  by  exploiting  locality.  Section  6  presents  conclusions  based  on  this  data, 
and  suggests  further  areas  to  explore. 
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Figure  1:  Standard  cell  placement  and  corresponding  cost  array. 

2  Applications,  Tools  and  Methodology 

To  understand  the  architectural  comparisons  being  made,  one  must  understand  the  application 
the  data  is  based  on.  and  the  tools  used  to  make  the  measurements.  Our  starting  point  was 
the  version  of  LocusRoute  written  for  a  shared  memory  machine.^  This  code  was  converted, 
as  described  in  Section  3.  to  a  message  passing  style.  Because  there  was  no  message  passing 
computer  available  to  run  the  code,  we  used  CBS.  a  simulator  for  parallel  message  passing 
machines.  With  cbs.  detailed  statistics  on  execution  time  and  network  behavior  are  readily 
available.  To  make  network  traffic  comparisons  between  the  message  passing  version  and  the 
shared  memory  version,  another  program,  also  described  below,  was  written  to  estimate  the 
amount  of  bus  traffic  required  by  the  shared  memory  approach. 

2.1  LocusRoute 

LocusRoute  [17]  is  an  industrial  quality  router  for  VLSI  standard  cells  developed  by  Jonathan 
Rose  at  Stanford  University.  LocusRoute  routes  the  wires  of  a  given  standard  cell  placement, 
while  attempting  to  minimize  the  overall  circuit  area.  To  do  this,  it  maintains  a  global  data 
structure  known  as  the  Cost  Array.  The  vertical  dimension  of  the  array  is  the  number  of 
routing  channels  in  the  circuit,  and  the  horizontal  dimension  of  the  array  is  the  number  of 
routing  grids.  The  Cost  Array  keeps  a  record  of  the  number  of  wires  running  through  each 
sector  of  the  circuit.  Each  wire  is  routed  along  the  path  with  the  minimal  sum  of  the  cost  array 
entries.  Figure  1  shows  a  standard  cell  circuit  and  one  of  its  wires,  with  the  corresponding  Cost 
Array.  The  highlighted  portions  of  the  cost  array  will  be  incremented  if  this  route  is  chosen. 

In  addition  to  producing  the  routed  circuit.  LocusRoute  also  computes  a  measure  of  the 
solution's  quality.  Quality,  also  referred  to  in  this  paper  as  the  circuit  height .  is  computed  as 
follows.  For  each  channel,  the  number  of  wires  using  the  channel  will  vary  across  the  width 
of  the  circuit.  The  number  of  routing  tracks  required  by  the  channel  is  the  maximum  number 
of  wires  running  through  the  channel  at  any  point.  The  circuit  height  is  the  total  number  of 
routing  tracks  required  for  all  channels. 

The  Cost  Array  is  the  central  data  structure  for  the  LocusRoute  application,  and  it  accounts 
for  almost  all  of  the  shared  data  references  made  by  LocusRoute.  Therefore,  studying  the 
reference  patterns  to  the  cost  array  will  provide  an  excellent  approximation  to  the  application's 
memory  reference  behavior  as  a  whole.  Examining  the  references  to  the  cost  array  for  one  wire, 
we  see  that  LocusRoute  starts  with  a  series  of  reads  to  explore  possible  routes  for  the  wire. 
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LocusRoute  reads  every  location  of  the  cost  array  along  the  paths  being  considered.  This  is 
followed  by  a  smaller  stream  of  writes,  as  the  cost  array  is  updated  along  the  final  path  of 
the  wire.  This  basic  sequence  of  reads  and  writes  occurs  for  each  wire,  with  several  processors 
routing  wires  in  parallel.  Performing  several  iterations  of  rooting  improves  the  final  solution 
quality,  but.  before  rerouting  a  wire  for  an  iteration  after  the  first  one.  the  processor  must  "rip 
up"  the  old  routing  of  the  wire  by  decrementing  the  cost  array  locations  in  its  path.  These  rip 
up  operations  are  the  second  type  of  writes  performed  on  the  cost  array. 

Consistency  of  the  cost  array  is  an  important  issue  in  this  paper,  and  one  with  serious  im¬ 
plications  on  the  amount  of  traffic,  so  it  is  important  to  understand  how,  and  to  what  extent, 
consistency  is  maintained  in  the  shared  memory  version  ofLocusRoute:  To  avoid  the  perfor¬ 
mance  bottleneck  a  lock  would  impose,  accesses  to  the  cost  array  are  not  locked.  This  implies 
that  simultaneous  operations  on  the  same  element  of  the  cost  array  may  result  in  one  of  the 
operations  being  lost.  As  previously  stated.  LocusRoute  is  a  optimization  problem,  and  can 
tolerate  a  certain  amount  of  inconsistency.  With  the  number  of  processors  the  shared  memory 
version  currently  uses  (up  to  16).  the  probability  of  simultaneous  writes  is  very  low.  and  exper¬ 
iments  indicate  that  the  quality  is  not  degraded.  Except  for  this,  consistency  in  the  cost  array 
is  maintained  at  the  hardware  level,  by  the  cache  coherency  hardware. 

When  running  the  experiments,  two  benchmark  circuits  were  used.  The  first  circuit.  bnrE 
has  420  wires,  a  size  of  10  channels  by  341  routing  grids,  and  represents  an  actual  standard 
cell  circuit  developed  at  Bell-Northern  Research  Ltd.  The  second  circuit.  MDC.  has  573  wires 
with  a  size  of  12  channels  by  386  routing  grids,  and  was  designed  at  the  University  of  Toronto 
Microelectronic  Development  Centre. 

2.2  CBS:  A  Message  Passing  Architecture  Simulator 

Execution  of  LocusRoute  on  a  message  passing  computer  was  simulated  using  a  program  called 
CBS.  CBS  [15]  is  a  C++  program  written  by  Andreas  Nowatzyk  at  Carnegie  Mellon  University 
which  simulates  the  behavior  of  a  k-arv  n-dimensional  hypercube  machine  (with  a  total  of  kn 
processors).  For  the  experiments  described  here.  CBS  simulated  a  machine  with  deterministic 
wormhole  routing  using  the  E-cube  routing  algorithm  [14.21].  and  with  the  dimension  n.  always 
equal  to  two  (mesh  interconnection).  The  use  of  wormhole  routing  minimizes  the  effect  of 
the  distance  between  destination  and  source,  making  the  assignment  of  processes  to  processing 
elements  less  critical.  Research  by  Dally  [7.8]  indicates  that  low-dimensional  networks  have 
greater  channel  bandwidth,  and  better  hot-spot  throughput  than  do  high-dimensional  networks. 
These  two  features  give  the  simulated  machine  low-latency,  high-bandwidth  communication 
performance  which  makes  it  competitive  with  the  shared  memory  machine. 

CBS  uses  a  detailed  simulation  model  to  produce  its  network  statistics.  CBS  simulates  the 
behavior  of  the  processor  interconnection  network  at  the  level  of  individual  flow  control  units  (in 
this  case,  single  bytes)  flowing  between  processors.  There  are  unidirectional  channels  connecting 
a  processor  to  its  North  and  East  neighbors.  This  means  that  a  packet  must  travel  all  the  way 
around  the  network  to  talk  to  its  West  neighbor.  The  network  performance  is  specified  with 
two  parameters:  t-delay  and  cudtlay.  T .delay  is  the  time  required  for  1  byte  to  travel  one  hop 
on  the  network,  and  c.deiay,  the  time  required  for  the  entire  packet  to  be  copied  dotvn  from 
the  processor  node  to  the  message  network,  or  up  from  the  message  network  to  the  destination 
processor  node.  Assuming  no  delays  due  to  contention,  the  total  time  required  for  a  packet  of 
length  L  to  travel  D  hops  on  the  network  is  Icjielay  +  tJelay{D  +  L).  To  simulate  the  execution 
time  of  the  node  processors,  a  delay  statement  is  provided  which  blocks  the  running  processor 
for  the  number  of  time  units  specified.  Timings  obtained  from  the  Encore  microsecond  clock 
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were  used  as  arguments  for  tbe  delay  statement. 

For  the  purpose  of  concreteness,  we  chose  to  set  the  performance  parameters  to  model  the 
behavior  of  the  Ametek  Series  2010  Message  Passing  Multicomputer  [18.10].  A  packet  of  length 
L  travelling  D  hops  on  the  Ametek  requires  CopyTime  +  HopTime(2D  +  L).  Copy  Time  is  the 
time  required  to  copy  the  message  from  the  network  to  the  node  processor's  address  space,  which 
depends  on  the  message  length.  This  can  be  performed  at  about  50MB/s.  Assuming  an  average 
message  length  of  200  bytes,  we  chose  CopyTime  to  equal  4000  ns.  so  c .delay  was  set  to  half 
of  that,  or  2000  ns.  1  HopTime  for  the  Ametek  is  defined  as  the  time  it  takes  for  one  byte 
of  a  packet  to  advance  one  hop.  assuming  that  the  route  has  already  been  established.  This 
is  stated  to  be  50  ns.  (Establishing  the  route  is  slower,  so  the  head  of  a  packet  requires  two 
HopTimes  to  advance  one  hop.)  To  make  the  Ametek  packet  latency  equation  conform  to  the 
CBS  form,  we  factored  out  the  2  to  get:  CopyTime  +  2 HopTime(D  +  L/2).  Now  we  can  set 
t -delay  equal  to  2HopTimt  or  100  ns.  When  the  simulation  is  run.  the  number  of  bytes  in  a 
packet  is  always  cut  in  half,  so  that  the  L/2  term  in  the  Ametek  equation  matches  the-L  term 
in  the  CBS  equation.  Also,  since  the  Ametek  2010's  processing  elements  are  about  five  times 
faster  than  the  Multimax's  processing  elements,  we  divide  the  times  obtained  on  the  Encore 
processing  elements  by  a  factor  of  five  before  using  them  as  arguments  to  the  delay  statement. 

2.3  Shared  Memory  Traffic  Evaluation 

While  for  the  message  passing  implementation,  the  network  traffic  is  directly  given  by  CBS.  there 
is  no  simple  way  of  estimating  traffic  for  the  shared  memory  implementation.  In  order  to  make 
comparisons  of  the  traffic  required  for  message  passing  and  shared  memory  approaches,  we  need 
a  method  for  measuring  the  traffic  generated  by  LocusRoute  on  a  shared  memory  machine. 
Although  we  have  the  capability  to  directly  trace  all  memory  references  [20.22].  these  direct 
methods  require  a  large  amount  of  memory,  which  limits  the  portion  of  the  program  that  can  be 
traced  to  about  1  wire  per  processor.  As  will  be  described  in  Section  3.  updates  in  the  message 
passing  approach  can  occur  at  time  intervals  greater  than  the  total  time  being  monitored  by  these 
detailed  trace  methods,  making  traffic  comparison  difficult.  Instead  we  have  chosen  to  modify 
the  shared  memory  version  of  LocusRoute  to  record  information  about  the  memory  reference 
stream  over  the  whole  execution  time  and  use  this  information  to  estimate  total  traffic. 

Before  explaining  the  traffic  estimating  program,  we  will  first  explain  in  detail  the  type  of 
machine  to  be  simulated.  This  work  considers  a  shared  memory  multiprocessor  with  a  single 
global  bus  that  all  processors  use  to  access  memory.  Each  processor  has  a  private  cache  memory, 
and  consistency  is  maintained  by  dedicated  cache  coherence  hardware  using  a  Write  Back  with 
Invalidate  scheme.  The  traffic  being  measured  in  the  shared  memory  version  is  the  traffic  on 
the  shared  global  bus.  2 

With  the  above  machine  model  in  mind,  we  now  describe  a  method  for  estimating  the 
bus  traffic.  The  uniprocessor  version  of  LocusRoute  is  modified  to  record  memory  reference 
information.  Data  is  printed  to  a  trace  file  whenever  one  of  four  types  of  events  occurs.  The 
four  event  types  are  described  below: 

1.  Wire  event:  Whenever  routing  of  a  new  wire  is  begun,  the  time  and  wire  name  are 
recorded  as  a  wire  event  in  the  trace  file. 

‘C-delay  if  a  parameter  aet  at  the  time  the  simulator  is  compiled.  Therefore,  one  cannot  compute  c .delay 
dynamically. 

JOur  traffic  estimating  program  can  simulate  non- bus-oriented  architectures  as  well,  but  since  we  are  only 
considering  16  processing  elements,  we  present  data  only  for  tbe  bus-based  scheme. 


2.  Iteration  event:  When  a  new  iteration  is  began,  the  time  at  which  it  was  begun  is 
recorded  as  an  iteration  event  in  the  trace  file.  These  first  two  events  are  needed  by  the 
traffic  evaluator  for  interleaving  the  execution  among  several  processors. 

3.  Read  event:  LocusRoute  is  structured  so  that  a  processor  reads  data  from  the  cost  array 
when  it  evaluates  possible  routes  for  a  wire  segment.  For  each  route  considered,  a  processor 
reads  all  the  locations  of  the  cost  array  along  the  path  of  the  route.  At  the  beginning  of 
each  of  these  read  sequences,  a  read  event  is  recorded  in  the  trace  file,  with  the  time,  and 
the  locations  affected  by  the  reads. 

4.  Write  event:  A  processor  writes  data  at  two  times:  (i )  at  the  end  of  exploring  alternative 
routes,  and  (ii)  when  it  does  a  rip-up  before  starting  to  explore  routes.  Both  of  these  are 
recorded  as  write  events  in  the  trace  file  along  with  the  time  at  which  they  are  begun,  and 
the  locations  affected  by  the  writes. 

All  events  are  assumed  to  be  atomic:  one  time  is  recorded  in  the  trace  file  at  the  beginning 
of  the  read  or  write  sequence,  and  all  reads  or  writes  associated  with  that  event  are  assumed  to 
occur  at  that  time. 

The  trace  file  described  above  records  the  relevant  memory  access  information  about  a 
uniprocessor  run  of  LocusRoute.  The  events  recorded  in  the  trace  file  give  enough  informa¬ 
tion  to  interleave  execution  among  more  processors,  so  that  multiprocessor  traffic  data  can  be 
estimated  using  the  steps  described  here.  First,  the  simulator  decides  on  an  assignment  of  wires 
to  processors,  using  one  of  the  heuristics  from  Section  3.  At  this  point,  each  event  in  the  trace 
is  associated  with  a  certain  wire,  and  each  wire  has  now  been  assigned  to  a  processor,  so  all  the 
events  are  now  associated  with  the  processor  executing  them.  Events  for  each  processor  can  be 
interleaved  using  the  times  recorded  in  the  trace  file.  To  simulate  the  traffic,  events  are  pulled  off 
the  interleaved  event  queue  and  handled  in  order.  The  cache  coherence  protocol  implemented 
is  Write  Back  with  Invalidate  Scheme  [2).  The  first  write  to  any  cache  line  results  in  a  bus 
operation  which  causes  all  other  caches  to  invalidate  that  line  if  it  is  present.  Subsequent  reads 
or  writes  by  that  processor  do  not  result  in  any  bus  traffic.  A  read  or  write  by  another  processor 
to  that  cache  line  causes  that  processor  to  become  the  owner,  and  forces  the  previous  owner  to 
invalidate  the  line  from  its  cache. 

Like  any  such  measurement  system,  ours  has  its  inaccuracies,  which  we  will  list  here.  First, 
the  system  simulates  infinite  caches.  Lines  are  only  written  back  to  memory  for  coherency 
reasons,  never  simply  for  replacement.  Because  the  data  structure  we  are  studying  is  smaller 
than  most  multiprocessor  caches  (about  8000  bytes),  this  is  not  a  major  flaw.  Second,  we  assume 
that  read  and  write  events,  which  are  conglomerations  of  several  read  or  writes,  occur  atomically. 
In  actual  execution,  these  operations  occur  as  sequences  of  operations  in  tight  (single  instruction ) 
loops.  The  atomic  assumption  only  leads  to  inaccuracy  when  a  read  event  and  a  write  event,  or 
two  write  events,  that  should  be  occurring  simultaneously  become  serialized  by  the  simulator.  If 
these  events  were  occurring  in  the  same  area  of  the  cost  array,  then  their  simultaneous  execution 
would  lead  to  multiple  invalidations  and  refetches.  If  they  are  serialized,  at  most  one  invalidation 
and  refetch  will  occur.  This  will  cause  the  simulator  to  slightly  underestimate  the  total  traffic. 
This  is  not  a  major  effect,  because  write  operations  are  relatively  infrequent,  so  simultaneous 
reads  and  writes  to  the  same  cache  line  are  highly  improbable.  Both  of  the  inaccuracies  tend  to 
slightly  underestimate  the  total  traffic  so  that  the  numbers  given  in  Section  5  may  be  considered 
lower  bounds  on  the  actual  traffic. 
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Figure  2:  Division  of  the  cost  array  among  processors. 

3  Implementing  LocusRoute  for  a  Message  Passing  Machine 

Finally,  before  we  go  on  to  discuss  the  results,  we  need  to  specify  how  we  mapped  LocusRoute  to 
a  message  passing  architecture.  Because  the  message  passing  machine  has  distributed  memory, 
implementing  LocusRoute  required  changes  in  the  distribution  and  updating  of  information 
between  processors.  To  reduce  message  traffic  on  the  network,  a  static  method  of  assigning 
wires  to  processors  was  used,  rather  than  allowing  processors  to  send  requests  out  when  they 
need  a  wire.  These  changes  are  described  below. 

3.1  Distribution  of  Data  Structures 

The  most  important  data  structure  in  the  program,  as  previously  stated,  is  the  Cost  Array.  In 
the  shared  memory  version,  all  processors  have  access  to  a  single  copy  of  the  cost  array.  The 
message  passing  architecture  forces  the  programmer  to  decide  how  the  cost  array  should  be 
handled  in  a  machine  with  distributed  memory.  We  chose  to  divide  the  cost  array  into  sections, 
with  each  processor  being  the  owner  of  one  section.  Each  processor  is.  however,  allowed  to 
have  a  view  of  the  whole  cost  array.  The  portion  that  is  owned  by  that  processor  will  be  as 
consistent  as  possible,  while  the  other  portions  of  the  cost  array  are  less  consistent,  but  still 
usable.  Consider,  for  example,  a  four  processor  case.  Figure  2  shows  each  processor's  cost  array, 
with  the  portion  that  it  owns  highlighted.  Although,  the  unowned  portions  of  each  processor's 
cost  array  may  not  be  accurate,  the  processor  is  still  allowed  to  make  use  of  them.  Thus,  if  there 
are  4  processors,  the  bottom  left  processing  element  will  own  the  bottom  left  fourth  of  the  cost 
array,  but  it  will  also  have  a  copy  of  the  rest  of  the  cost  array  which  it  can  use.  In  all  future 
discussion,  the  processor  which  owns  a  certain  region  of  the  cost  array  will  be  called  the  ou-ner 
processor  for  that  region,  and  the  region  itself  will  be  called  the  owned  region. 

3.2  Maintaining  Cost  Array  Consistency 

No  circuit  is  perfectly  local,  that  is.  wires  assigned  to  one  processing  element  will  extend  into 
regions  owned  by  other  processing  elements.  Consequently,  using  the  scheme  discussed  above 
cost  array  updates  between  processors  are  needed.  Many  different  methods  of  performing  these 
updates  are  possible:  one  can  experiment  with  the  frequency  of  updates,  as  well  as  how  the 
updates  are  initiated.  The  update  frequency  was  allowed  to  vary,  with  updates  occurring  at 
time  intervals  on  the  order  of  the  time  it  takes  to  route  one  wire.  The  decision  of  which  update 
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strategy'  to  use  depends  heavily  on  the  underlying  computer  architecture.  We  have  considered 
two  main  types  of  updates,  and  variations  on  these.  The  two  types  of  updates  considered  are 
tender  initiated  updates,  and  receiver  initiated  updates,  as  well  as  a  mixture  of  these  two. 

With  sender  initiated  updates,  the  processor  which  determines  that  an  update  is  necessary 
is  the  one  to  send  out  the  data.  With  receiver  initiated  updates,  the  processor  which  determines 
that  an  update  is  necessary  sends  a  request  packet  to  another  processor,  and  the  destination 
processor  then  sends  back  the  requested  update  data.  The  architectural  dependencies  in  these 
schemes  should  be  dear.  For  receiver  initiated  updates  to  be  useful,  the  latency  of  the  network 
as  well  as  the  message  reception  time,  must  be  low,  so  that  the  requesting  processor  spends  a 
minimal  amount  of  time  idle,  waiting  for  the  requested  data.  On  the  other  hand,  our  results 
shown  in  Section  4  indicate  that  sender  initiated  updates  tend  to  send  oat  more  bytes  than 
receiver  initiated,  and  therefore  place  a  greater  premium  on  high  network  bandwidth. 

3.3  Wire  Assignment 

One  advantage  of  the  shared  memory  architecture  is  that  the  wires  to  be  routed  can  be  eas¬ 
ily  allocated  to  processors  dynamically,  using  a  distributed  loop.  3  In  the  message  passing 
architecture,  dynamic  wire  allocation  requires  message  transactions  on  the  network.  In  our  im¬ 
plementation.  processors  only  retrieve  messages  queued  for  them  at  the  end  of  each  wire  routed. 
Assuming  that  the  processor  receiving  the  task  requests  is  also  routing  wires,  the  potential  wait 
for  a  wire  task  is  large,  because  in  this  case,  a  processor  may  have  to  wait  for  an  entire  wire  to 
be  routed  before  the  wire  assignment  processor  even  receives  the  message.  With  this  in  mind, 
a  static  method  of  wire  assignment  was  used.  Because  of  the  division  of  cost  array  into  owned 
regions,  the  algorithm  benefits  from  a  method  of  wire  assignment  that  attempts  to  assign  wires 
to  the  processor  that  owns  the  region  they  run  through.  We  use  a  very  simple  heuristic  devel¬ 
oped  to  achieve  this  goal.  We  assign  wires  to  the  owner  processor  of  the  lower  leftmost  pin  of 
the  wire. 

To  control  the  amount  of  locality  exploited,  a  parameter  ThresholdCost  is  provided.  If  a 
wire  s  "cost",  a  function  of  its  length,  is  less  than  the  threshold,  it  will  be  assigned  using  the 
heuristic  described  above.  Otherwise,  it  is  held  in  a  pool  of  unassigned  wires,  and  is  assigned 
to  a  processor  at  the  end  of  the  wire  assignment  phase.  The  processor  it  is  assigned  to  is  the 
processor  whose  total  wire  cost  is  the  current  minimum.  With  this  method,  a  high  value  of 
ThresholdCost  results  in  a  wire  assignment  that  is  based  primarily  on  locality,  while  a  low  value 
of  ThresholdCost  results  in  a  wire  assignment  that  is  based  mostly  on  load  balancing. 


4  Traffic  in  Shared  Memory  and  Message  Passing  Architec¬ 
tures 

In  this  section,  we  present  results  on  the  traffic  generated  by  shared  memory  and  message  passing 
architectures.  We  think  this  is  a  useful  measure  because  in  both  architectures  overly  high  network 
traffic  results  in  a  performance  penalty.  For  example,  since  the  message  passing  architecture 
forces  the  cost  array  to  be  distributed  across  the  processors,  periodic  update  messages  are  needed 
to  keep  the  processors'  views  of  the  cost  array  consistent.  There  is  computational  overhead 
associated  with  sending  and  receiving  these  messages,  so  one  would  like  to  update  as  infrequently 

3  Associated  with  a  distributed  loop  is  a  locked  index  variable.  To  get  the  next  wire  to  be  routed,  a  processor 
obtains  the  lock,  reads  and  increments  the  index  of  the  next  wire  to  be  routed,  and  releases  the  lock. 
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u  possible.  Similarly,  in  the  shared  memory  architecture,  hardware  cache  consistency  protocols 
cause  extra  global  bus  traffic  due  to  cache  line  invalidations  and  any  subsequent  refetches  that 
may  be  needed.  These  operations  cause  the  processor  to  stall,  and  also  represent  a  performance 
overhead.  Although  the  degree  to  which  network  traffic  translates  to  performance  overhead 
will  be  different  for  the  two  architectures,  we  contend  that  a  comparison  of  the  network  traffic 
between  the  two  architectures  is  itself  useful. 

This  section  will  show  that  traffic  for  the  shared  memory  architecture  is  a  strong  function 
of  the  cache  line  size,  while  traffic  in  the  message  passing  architecture  is  explicitly  controlled  by 
the  programmer.  This  explicit  control  allows  the  traffic  to  be  reduced  by  more  than  two  orders 
of  magnitude  compared  to  the  shared  memory  traffic,  and  the  program  still  gives  comparable 
solution  quality. 

4.1  Traffic  in  the  Shared  Memory  Architecture 

Here  we  consider  traffic  in  the  shared  memory  approach.  Traffic  in  the  shared  memory  approach 
is  made  up  of  3  parts.  First,  the  processor's  very  first  access  to  a  location  always  results  in  a 
miss,  and  brings  the  line  into  the  cache.  Second,  the  first  write  to  a  clean  location  causes  a 
word  write  on  the  shared  bus.  The  other  processors  see  this  write  and  invalidate  that  cache  line 
if  it  is  in  their  cache.  Third,  once  a  line  has  been  invalidated  by  a  cache,  it  may  need  the  line 
again.  This  leads  to  refetches  of  data  from  memory.  Traffic  in  the  shared  memory  architecture 
is.  clearly,  a  function  of  the  cache  coherence  protocol  used,  and  the  line  size  of  the  cache.  For 
all  the  results  given  here,  the  coherence  protocol  used  was  a  Write  Back  with  Invalidate  scheme 
[2].  The  line  size  of  the  cache  was  allowed  to  vary. 

Increases  in  the  cache  line  size  can  have  the  effect  of  either  increasing  or  decreasing  traffic. 
There  are  two  factors  which  will  increase  traffic  with  increasing  line  size.  First,  with  an  increased 
line  size,  data  items  that  will  never  be  used  are  more  likely  to  be  brought  into  the  cache.  This 
will  increase  the  traffic  on  the  bus.  Second,  increasing  the  line  size  means  there  will  be  more 
data  in  the  cache  (under  the  infinite  cache  assumption)  and  this  means  that  processors  are  more 
likely  to  interfere  with  each  other.  With  more  data  in  the  caches,  processors  are  more  likely  to 
force  invalidations  in  other  caches.  These  invalidations,  as  well  as  the  subsequent  refetches,  also 
cause  the  traffic  to  increase.  On  the  other  hand,  it  is  possible  for  a  longer  cache  line  to  cause 
a  traffic  decrease  as  well.  If  there  are  several  shared  data  items  stored  relatively  dose  to  each 
other,  then  a  single  invalidation  of  a  long  cache  line  could  cause  them  to  all  be  invalidated  in  one 
operation.  This  can  save  bytes  over  the  case  of  several  individual  invalidations,  and  cause  the 
traffic  to  decrease.  Because  this  last  situation  happens  infrequently,  its  effect  is  minor  compared 
to  the  first  two.  Thus,  we  expect  that  increasing  the  cache  line  size  will  lead  to  an  increase  in 
the  number  of  bytes  transferred. 

Table  1  shows  the  shared  memory  bus  traffic  as  a  function  of  the  cache  line  size.  As  predicted, 
the  data  clearly  shows  that  the  traffic  increases  significantly  as  the  line  size  increases.  For 
example,  in  the  MDC  circuit,  a  cache  line  size  of  4  bytes  causes  the  total  traffic  to  be  932.976 
bytes  while  a  larger  cache  line  size  of  32  bytes  causes  the  traffic  to  increase  sharply  to  5,840.280. 
more  than  five  times  as  much. 

Solution  quality  and  execution  time  are  not  available  for  the  wire  assignments  shown  in 
Table  1  because  the  actual  shared  memory  implementation  uses  a  dynamic  wire  assignment. 
The  simulator  does  not  actually  route  wires,  so  it  cannot  output  solution  quality  or  execution 
time  values.  However,  for  comparison  with  the  message  passing  figures  presented  later  in  the 
paper,  the  shared  memory  version  of  LocusRoute  running  on  an  Encore  Multimax  with  16 
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Tibi*  1:  Traffic  as  a  function  of  cache  line  size  in  shared  memory  version. 
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processors  can  route  the  bnrE  circuit  in  a  time  of  5.59  seconds  with  a  height  of  136.  For  MDC. 
the  solution  quality  is  144  in  a  time  of  5.78  seconds. 

4.2  Traffic  in  the  Message  Passing  Architecture 

Now  that  we  have  presented  data  from  the  shared  memory  architecture,  we  move  on  to  the 
data  from  the  message  passing  architecture,  comparing  the  two  as  we  go.  Traffic  in  the  message 
passing  approach  is  determined  bv  the  programmer,  subject  to  the  constraint  of  acceptable 
solution  quality.  The  programmer  controls  the  size  of  messages,  as  well  as  their  frequency.  The 
results  given  in  Table  2  show  the  traffic  required  by  the  message  passing  approach  with  varying 
update  strategies.  All  the  results  given  are  for  16  processors.  Because  of  the  explicit  tradeoffs 
in  the  message  passing  approach  between  network  traffic,  solution  quality,  and  execution  time, 
one  cannot  discuss  the  amount  of  network  traffic  required  without  also  discussing  the  update 
strategies  used  and  the  resulting  solution  quality  and  execution  time.  Table  2  shows  data  for 
three  different  update  strategies  (discussed  in  Section  3.2).  Recall  that,  in  a  sender  initiated 
strategy,  it  is  the  sender  of  the  update  that  determines  when  the  update  should  be  sent.  In  the 
receiver  initiated  strategy,  the  processor  wishing  to  receive  an  update  sends  a  request  to  another 
processor,  who  then  returns  the  data.  The  third  strategy  is  a  mixture  of  the  other  two.  In  this 
third  mixed  approach,  sender  initiated  updates  occur  with  the  same  frequency  as  in  the  purely 
sender  initiated  strategy,  and  receiver  initiated  updates  occur  with  the  same  frequency  as  in  the 
purely  receiver  initiated  approach. 


4.2.1  Network  Traffic 

There  are  two  points  to  be  made  about  the  message  passing  network  traffic.  First,  there  is  a 
large  difference  between  bytes  transferred  for  sender  initiated  and  receiver  initiated.  Intuitively, 
one  would  expect  the  receiver  initiated  approach  to  be  more  efficient  in  terms  of  network  traffic, 
because  data  is  only  sent  when  it  is  specifically  requested.  In  contrast,  in  the  sender  initiated 
approach,  data  is  sent  periodically,  regardless  of  whether  the  destination  processor  is  routing 
wires  in  that  area  and  needs  that  data  or  not.  Consequently,  one  would  expect  a  larger  number 
of  bytes  to  be  necessary  to  get  the  same  quality  as  receiver  initiated.  The  data,  shown  in  Table 
2  bears  out  this  intuition.  Receiver  initiated  transfers  use  anywhere  from  44%  to  71%  few’er 
bytes  than  sender  initiated  to  produce  similar  quality  results.  Because  sending  and  receiving 
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Table  2:  Traffic  in  the  message  passing  version. 
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messages  has  a  computational  overhead,  the  savings  in  network  traffic  translate  to  time  savings 
as  well.  For  all  the  trials  given  in  Table  2.  the  receiver  initiated  method  is  about  20%  faster 
than  the  corresponding  run  using  the  sender  initialed  method. 

The  second  point  to  note  about  the  network  traffic  on  the  message  passing  architecture  is 
that,  even  with  the  less  efficient  sender  initiated  method,  the  message  passing  network  traffic  is 
more  than  an  order  of  magnitude  less  than  that  for  shared  memory.  The  huge  difference  may. 
at  first  glance,  be  surprising,  but  it  can  be  explained  by  two  factors.  First,  the  updates  being 
performed  in  the  message  passing  version  occur,  at  most,  once  per  wire.  In  the  data  shown  in 
Table  2.  they  occur  approximately  every  two  wires.  Second,  these  updates  can  be  thought  of  as 
constituting  a  very  loose  form  of  coherence  protocol.  There  are  several  differences  between  this 
protocol  and  the  strict  one  implemented  on  the  shared  memory  system.  Because  updates  occur 
no  more  frequently  than  once  per  wire,  the  write  performed  at  the  wire  rip  up  stage  is  handled 
at  the  same  time  as  the  write  performed  at  the  wire  routing  stage.  Because  much  of  the  wire  s 
path  will  remain  the  same  after  rerouting,  these  two  writes  will  often  cancel  each  other,  and 
many  of  the  locations  will  not  need  to  be  updated  at  all.  4  This  removes  many  of  the  write 
operations,  a  significant  accomplishment  since  writes  are  the  cause  for  over  80%  of  the  bytes 
transferred  in  the  shared  memory  version. 

4.2.2  Solution  Quality 

A  discussion  of  network  traffic  in  the  message  passing  approach  is  incomplete  without  also 
discussing  the  solution  quality  and  execution  time  required.  The  first  point  we  note  is  that  the 
solution  quality,  that  is  the  height  of  the  circuit,  has  degraded  slightly  from  the  quality  given 
by  the  shared  memory  version,  but  is  still  acceptable.  Recall  that  the  solution  quality  for  the 
shared  memory  approach  was  136  for  bnrE  and  144  for  MDC. 

Some  quality  degradation  can  be  expected  in  the  message  passing  approach,  because  less 
information  is  available  to  each  processor  as  it  is  routing.  For  example,  the  cache  coherence 
hardware  on  the  shared  memory  machine  guarantees  a  perfectly  consistent  view  for  all  processors 
at  all  times.  The  only  inconsistency  in  the  shared  memory  approach  comes  from  not  locking  the 
cost  array,  and  as  previously  explained,  this  has  no  noticeable  effect  on  the  quality.  By  contrast, 
in  the  message  passing  approach,  updates  occur  only  after  the  processors  have  each  routed 

Mn  the  message  passing  implementation,  a  delta  array  is  maintained  which  records  changes  made  to  the  cost 
array.  If  no  changes  are  made  to  a  location,  or  the  changes  cancel  each  other,  updates  for  that  location  need  not 
be  sent. 
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one  or  more  wires.  This  leads  to  a  much  larger  level  of  inconsistency  in  the  processors*  cost 
arrays.  In  the  worst  case,  the  solution  quality  in  the  message  passing  approach  has  degraded 
by  10%  from  the  shared  memory  solution  quality.  LocusRoute  is  intended  to  be  used  with 
a  standard  cell  placement  program,  so  that  once  the  placement  program  has  decided  on  a 
placement,  LocusRoute  performs  the  routing  for  that  placement.  LocusRoute  returns  the  circuit 
height  as  a  measure  of  that  particular  placement  t  quality.  Thus,  in  this  situation,  slight  declines 
in  the  quality  of  the  routing  can  be  tolerated.  However,  if  LocusRoute  is  used  to  produce 
the  final  routing  for  a  circuit  that  is  to  be  mass  produced,  this  increased  area  could  become 
quite  significant.  For  these  cases,  this  characteristic  of  the  message  passing  implementation  of 
LocusRoute  could  make  it  undesirable. 

Section  5  will  discuss  how  the  quality  of  the  solution  can  be  improved  somewhat  by  exploiting 
locality  in  the  wires.  However,  another  obvious  way  to  improve  the  solution  quality  is  bv 
increasing  the  frequency  of  updates.  The  best  combination  of  execution  time  and  solution 
quality  obtained  for  bnrE  was  a  solution  quality  of  136  with  an  execution  time  of  1.7  seconds 
and  843,542  bytes  transferred.  Note  that  the  quality  of  this  run  equals  the  quality  given  by  the 
shared  memory  version.  The  bytes  transferred,  while  much  larger  than  any  of  those  in  Table  2  is 
still  about  a  factor  of  two  less  than  those  measured  for  the  shared  memory  version.  Results  such 
as  this  indicate  the  robustness  of  the  LocusRoute  algorithm  to  inconsistencies  in  the  cost  array. 
For  LocusRoute.  and  ether  applications  like  it.  hardware  cache  consistency  seems  to  impose  a 
large  cost  on  the  execution  of  the  program,  without  giving  compensating  benefits. 

4.2.3  Execution  Time 

Now.  we  turn  to  the  execution  time.  Execution  time  for  the  message  passing  version  of  Lo¬ 
cusRoute  depends  heavily  on  the  wire  assignment  strategy  for  two  reasons.  If  the  wires  are 
not  assigned  in  a  way  that  will  balance  the  load  on  the  processors,  execution  time  will  suffer 
greatly.  This  is  discussed  in  Section  5.3.2.  On  the  other  hand,  if  the  wire  assignment  does 
not  exploit  locality  well,  more  update  packets  will  need  to  be  sent  to  get  the  same  quality,  and 
the  extra  processing  time  spent  sending  and  receiving  messages  will  show  up  in  the  final  exe¬ 
cution  time  of  the  run.  In  general,  the  execution  times  given  in  Table  2  are  much  faster  than 
those  for  the  shared  memory  runs.  However,  recall  that  CBS  is  simulating  processors  which  are 
five  times  faster  than  the  processors  used  when  timing  the  shared  memory  version.  A  rough 
comparison  can  be  obtained  by  multiplying  the  execution  times  of  the  message  passing  version 
by  five.  The  fastest  execution  time  obtained  overall  is  0.97  seconds  for  the  receiver  initiated 
scheme.  Multiplying  by  five,  the  execution  time  becomes  4.85  seconds  which  is  still  13%  faster 
than  the  execution  time  from  the  shared  memory  case.  The  quality  obtained  in  the  message 
passing  run  being  considered  was  only  7%  worse  than  the  shared  memory  quality.  Note  that 
simple  multiplication  by  a  factor  of  five  when  comparing  the  execution  times  favors  the  shared 
memory  architecture.  This  is  because  if  the  processors  in  the  shared  memory  machine  really 
were  five  times  faster,  there  would  be  more  contention  on  the  bus.  and  the  overall  performance 
would  not  improve  by  a  factor  of  five. 


5  Effect  of  Locality 

The  previous  section  compared  the  network  traffic  required  by  LocusRoute  in  the  shared  memory 
and  message  passing  architectures.  Here  we  examine  the  effectiveness  of  exploiting  locality  to 
reduce  this  traffic  in  both  architectures.  Locality,  here,  is  a  measure  of  how  often  a  processor  is 
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routing  wires  within  its  owned  region  or  regions  close  by.  (A  quantitative  measure  is  described 
in  Section  5.3.1.)  Both  architectures  benefit  differently  from  exploiting  locality.  Message  passing 
architectures  benefit  from  locality  because  tbe  need  for  message  traffic  to  produce  a  certain  level 
of  solution  quality  is  reduced.  This  is  because  improving  the  locality  of  the  wires  routed  by  each 
processor  means  that  each  processor  will  have  a  better  view  of  the  part  of  the  cost  array  it  is 
routing  in.  and  fewer  updates  will  be  needed.  Shared  memory  architectures  benefit  from  locality 
through  better  cache  behavior.  Specifically,  shared  memory  architectures  benefit  because  of  two 
factors— better  spatial  locality,  and  less  interference  between  processors  causing  cache  coherence 
traffic. 

In  the  past,  locality  has  not  played  a  major  part  in  the  design  of  shared  memory  parallel 
programs.  In  a  traditional  global  bus  shared  memory  computer,  all  memory  is  equally  acces¬ 
sible  to  all  processors.  However,  as  hierarchical  approaches  are  used  to  scale  shared  memory 
multicomputers,  this  is  no  longer  true.  The  current  trend  towards  hierarchical  shared  memory 
machines,  in  which  a  local  reference  can  be  more  than  an  order  of  magnitude  faster  than  a  non¬ 
local  reference  implies  that  locality  must  become  an  important  part  of  future  program  design. 
In  this  section  we  present  data  that  indicates  that  implementations  taking  advantage  of  locality 
can  reduce  the  total  network  traffic  by  as  much  as  67% . 

5.1  Effect  of  Locality  in  the  Shared  Memory  Architecture 

In  this  subsection,  we  study  the  effects  of  locality  on  the  interconnection  network  traffic  generated 
by  a  shared-memory  architecture.  Table  3  shows  the  amount  of  network  traffic  generated  as  a 
function  of  differing  amounts  of  locality  exploited  in  the  application.  The  extreme  non-local 
case  is  taken  to  be  the  round  robin  wire  assignment  to  processors,  and  the  extreme  local  case 
(Threshold Cost  =  infinity)  is  taken  to  be  the  one  where  each  wire  is  assigned  to  the  processor 
whose  owned  region  contains  its  lower  left  pin.  We  also  consider  two  cases  with  intermediate 
locality. 

Table  3:  Effect  of  locality  in  shared  memory  version. 


Allocation 

Total 

Wires  Held  for 

Bytes 

Reduction  from 

Circuit 

strategy 

Wires 

round  robin  asmt. 

Transferred 

round  robin  (% ) 

bnrE 

round  robin 

420 

420 

1.306.152 

— 

ThresholdCost  =  30 

209 

1.299.580 

0.5 

ThresholdCost  =  1000 

25 

1.275.492 

2.3 

ThresholdCost  =  inf 

0 

1.219.576 

6.6 

MDC 

round  robin 

HI 

573 

1.596.940 

— 

ThresholdCost  =  30 

263 

1.608.520 

-0.7 

ThresholdCost  =  1000 

II 

38 

1.593.104 

.•  |  |  •••  | 

ThresholdCost  =  inf 

■Hi 

0 

1.516.468 

5.0  | 

For  bnrE  with  8  byte  cache  lines,  total  global  bus  traffic  can  be  reduced  6.6%  by  taking 
advantage  of  locality  in  the  assignment  of  wires.  It  is  clear  that  locality  is  not  producing  the 
significant  benefit  we  had  expected.  The  reason  for  this  is  that  cache  lines  contain  too  many 
cost  array  entries.  Since  each  cost  array  entry  is  one  byte,  a  cache  line  holds  eight  cost  array 
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entries.  This  means  that  for  each  byte  accessed,  seven  of  its  neighbors  are  brought  in  as  well. 
The  increase  in  traffic  brought  about  by  interference  between  the  caches  maintaining  coherence 
is  almost  as  big  as  the  decrease  in  traffic  due  to  locality.  To  check  this  theory,  we  increase  the 
size  of  the  cost  array  entries  to  4  byte  integers,  so  an  8  byte  cache  line  will  hold  only  two  of 
them.  In  this  case,  the  total  traffic  for  a  round  robin  wire  assignment  is  higher —  up  to  1.799.984 
bytes,  but  the  percent  gain  from  exploiting  locality  is  higher  as  well.  For  4  byte  array  entries 
with  an  8  byte  cache  line,  changing  to  the  infinite  ThresholdCost  wire  assignment  reduces  the 
traffic  by  15.6%.  The  reason  for  this  bigger  reduction  is  the  following.  With  a  cache  line  that 
holds  more  array  entries,  the  probability  of  interference  between  processors  causing  invalidations 
is  higher.  Intuitively  with  these  1  byte  cache  entries,  it  is  harder  for  a  processor  to  be  active 
in  only  a  small  section  of  the  co6t  array,  because  every  time  an  array  entry  is  accessed,  eight 
entries  are  fetched  into  the  cache.  If  each  processor  has  many  “extra’’  entries  in  its  cache,  the 
cache  coherence  traffic  will  reduce  the  benefit  possible  from  exploiting  locality.  Obviously,  we 
are  not  concluding  that  all  variables  be  made  as  large  as  possible,  so  that  they  interfere  less 
with  each  other.  Rather.. the  conclusion  is  that  care  must  be  used  when  allocating  memory  for 
write-shared  data  items.  The  notion,  stemming  from  uniprocessor  caches,  that  dense  data  will 
display  better  locality,  and  therefore  better  caching  behavior,  is  no  longer  true  when  speaking  of 
multiprocessor  cache  coherent  systems.  In  multiprocessor  cache  coherent  systems,  the  benefits 
of  data  density  must  be  traded  off  with  the  penalty  of  increased  coherency  traffic. 

5.2  Effect  of  Locality  in  the  Message  Passing  Approach 

Having  examined  the  traffic  in  the  shared  memory  case,  we  move  to  the  message  passing  case, 
where  the  effect  of  locality  is  much  more  significant.  Data  in  Table  4  shows  the  effect  of  various 
wire  assignment  strategies  on  the  quality  of  the  routed  circuit,  the  execution  time,  and  the 
number  of  bytes  transferred. 

One  can  see  that  in  general,  wire  assignments  which  do  not  take  advantage  of  locality,  such 
as  round  robin,  result  in  poorer  solution  quality  than  those  that  do.  such  as  assignments  made 
with  infinite  ThresholdCost.  The  average  quality  improvement  due  to  locality  over  the  cases 
shown  in  Table  4  is  about  4%.  While  small,  this  improvement  is  quite  significant,  because  the 
results  given  are  all  quite  close  to  optimal  anyway,  so  even  a  small  percentage  improvement  is 
difficult  to  achieve.  This  data  indicates  that  it  is  best  to  have  a  single  processor  route  the  wires 
in  one  area,  because  that  processor  will  have  more  accurate  information  about  the  cost  array  in 
that  area. 

Having  seen  that  exploiting  locality  improves  the  solution  quality,  the  next  question  is.  what 
is  the  effect  of  locality  on  the  number  of  bytes  transferred?  This  depends  heavily  on  the  type  of 
update  strategy  used.  In  the  sender  initiated  scheme,  updates  are  sent  out  for  any  owned  region 
in  the  sender ‘s  array  that  has  changed,  so  the  only  reduction  in  traffic  will  be  due  to  changes 
being  made  in  fewer  and  smaller  regions  of  the  cost  array.  The  change  in  bytes  transferred 
for  sender  initiated  updates  from  a  round  robin  assignment  to  a  local  assignment  with  infinite 
ThresholdCost  is  11.4%.  The  receiver  initiated  scheme  will  be  more  sensitive  to  locality,  because 
in  this  strategy,  low  locality  results  in  frequent  interprocessor  data  requests.  A  processor  only 
requests  an  update  if  it  has  routed  in  a  certain  owned  region  a  specific  number  of  times.  If. 
by  exploiting  locality,  we  reduce  the  frequency  with  which  update  requests  need  to  be  made, 
we  can  dramatically  reduce  the  message  traffic.  The  data  bears  out  this  prediction,  with  a 
traffic  reduction  of  more  than  a  factor  of  two  in  both  circuits,  when  going  from  a  round  robin 
assignment  policy  to  a  local  one.  Because  the  mixed  strategy  is  made  up  partly  of  receiver 
initiated  requests,  which  are  highly  sensitive  to  locality  and  partly  of  sender  initiated  requests. 
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Table  4:  Effect  of  locality  in  the  message  passing  version. 


Update 

Wire 

Circuit 

Exec. 

Bytes 

Reduc.  from 

Circuit 

method 

Assignment 

Height 

Time 

Transferred 

rnd-rbn  (%) 

bnrE 

Sender  Initiated 

round  robin 

145 

■Vgggjl 

— 

T  hr. Cost  =  30 

138 

1.467 

4.4 

Thr.C'ost  ss  1000 

139 

1.647 

9.8 

Thr.Cost  =  inf 

135 

-5.073 

138636 

11.4 

bnrE 

Receiver  Initiated 

round  robin 

150 

Ijg  Tjl 

87572 

Thr.Cost  =  30 

145 

Ex  73 

76334 

Thr.Cost  =  1000 

139 

1.217 

51582 

Thr.Cost  =  inf 

140 

4.043 

39858 

54.5 

bnrE 

Mixed 

round  robin 

wmgm 

1.519 

— 

Thr.Cost  =  30 

m 

218-568 

10.9 

Thr.Cost  =  1000 

142 

1.601 

179372 

26.9 

Thr.Cost  =  inf 

140 

4.864 

169992 

30.7 

MDC 

Sender  Initiated 

round  robin 

Kl 

2.171 

— 

Thr.Cost  =  30 

1.789 

■  . 

2.1 

Thr.Cost  =  1000 

5BB 

1.909 

225912 

4.4 

Thr.Cost  =  inf 

148 

5.323 

5.6 

MDC 

Receiver  Initiated 

round  robin 

85646 

— 

Thr.Cost  =  30 

18.0 

Thr.Cost  =  1000 

mm 

1.192 

1  1 

41.3 

Thr.Cost  =  inf 

152 

4.479 

28132 

67.2 

MDC 

Mixed 

round  robin 

2.208 

324914 

— 

Thr.Cost  =  30 

mm 

1.707 

291428 

10.3 

Thr.Cost  =  1000 

150 

1.850 

249578 

23.2 

Thr.Cost  =  inf 

150 

5.429 

236804 

27.1 

which  are  not  very  sensitive  to  locality,  the  reduction  in  network  traffic  is  between  the  two  other 
cases.  The  benefit  gained  by  exploiting  locality  in  the  mixed  sender/receiver  case  is  larger  than 
in  sender  initiated,  and  smaller  than  in  receiver  initiated. 

Of  course,  locality  also  has  an  effect  on  the  execution  time  of  the  application.  As  one  improves 
the  locality  of  the  application,  fewer  update  messages  are  needed,  and  the  time  the  processors 
spend  sending  and  receiving  messages  is  reduced.  This  has  a  direct  effect  on  the  execution  time. 
Unfortunately,  there  is  another  opposing  effect  as  well.  If  all  wires  are  assigned  to  processors 
strictly  on  the  basis  of  locality,  it  is  likely  that  the  resulting  wire  assignment  will  not  be  load 
balanced.  This  effect  will  be  discussed  in  Section  5.3.2. 

5.3  Limitations  on  Exploiting  Locality 

The  previous  subsections  have  shown  that  exploiting  locality  to  reduce  execution  time  and 
increase  quality  clearly  has  some  benefit.  Unfortunately,  there  are  several  factors  which  limit 
the  amount  to  be  gained  by  taking  advantage  of  locality  in  a  problem.  First,  the  standard  cell 
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circuits  themselves  have  only  a  limited  amount  of  locality.  If  the  vires  to  be  routed  are  long 
enough  to  pass  through  the  owned  regions  of  several  processors,  there  is  an  unavoidable  amount 
of  interprocessor  communication  that  will  take  place  to  perform  the  necessary  updates.  Further, 
as  the  number  of  processors  increases,  the  region  size  will  decrease,  and  the  locality  will  decrease 
as  well.  Second,  fully  exploiting  the  locality  that  does  exist  can  often  interfere  with  the  load 
balancing  of  the  processors.  If  there  are  many  wires  within  a  single  processor's  region,  then 
considering  locality  alone,  that  processor  should  route  them  all.  However,  this  may  cause  that 
processor  to  have  a  disproportionate  amount  of  work,  resulting  in  a  load  imbalance  and  poor 
performance.  These  two  issues  are  treated  in  the  sections  below. 

5.3.1  Limited  Circuit  Locality 

To  determine  an  upper  bound  on  the  degree  to  which  LocusRoute  can  take  advantage  of  locality, 
we  developed  a  measure  of  the  amount  of  locality  in  the  standard  cell  circuits  being  used.  Using 
this  measure,  we  found  that  the  degree  to  which  locality  can  be  exploited  is,  in  part,  limited  by 
the  circuits  themselves.  This  section  will  first  describe  the  method  of  measuring  circuit  locality, 
and  then  analyse  the  results  for  the  two  benchmark  circuits. 

The  measure  is  computed  in  the  following  steps.  First,  wires  are  assigned  to  processors 
using  one  of  the  methods  already  described.  Next,  for  each  processor,  the  program  computes 
the  number  of  wire  segments  to  be  routed  in  each  region  of  the  cost  array,  and  the  distance  in 
hops  from  the  region  where  the  segment  lies  to  the  processor  performing  the  routing.  Locality  is 
considered  to  be  the  average  distance  between  the  routing  processor  and  the  processor  that  owns 
a  region,  weighted  by  the  number  of  wire  segments  routed  at  this  distance.  Thus,  a  low  number 
means  good  locality.  For  example,  a  locality  measure  of  0  indicates  that  all  segments  were 
routed  by  the  region  owner,  giving  perfect  locality.  Increases  in  the  locality  measure  indicate 
that  the  average  segment  is  being  routed  at  a  distance  further  from  the  owner.  Note  that  when 
a  wire  assignment  with  infinite  ThresholdCost  is  used  for  this  calculation,  one  end  of  the  wire  is 
guaranteed  to  be  in  the  owned  region  of  t  he  processor  doing  the  routing.  In  this  case,  the  degree 
of  locality  is  mainly  a  measure  of  the  length  of  the  wire,  compared  to  the  size  of  the  individual 
cost  array  regions  because  it  measures  how  many  hops  the  other  end  of  the  wire  is  from  the 
lower  left  end.  which  is  certainly  in  the  region  of  the  routing  processor. 

Computing  the  locality  for  the  two  test  circuits,  using  several  wire  allocation  strategies, 
gave  the  results  shown  in  Table  5.  These  results  are  computed  assuming  16  processors.  The 
results  indicate  that  the  amount  of  locality  to  be  exploited  in  these  circuits  is  limited.  For  the 
bnrE  circuit,  using  a  round  robin  assignment,  the  average  segment  gets  routed  by  a  processor 
1.96  hops  away  from  the  processor  that  owns  the  region  the  segment  lies  in.  When  the  wire 
assignment  method  is  changed  to  one  with  ThresholdCost  equal  to  infinity,  the  average  segment 
is  routed  by  a  processor  only  1.24  hops  away.  As  the  number  of  processors  is  increased,  the 
locality  in  the  circuit  will  decrease  because  the  size  of  each  owned  region,  formed  by  splitting 
the  cost  array  into  equal  chunks,  will  decrease.  This  limited  locality  in  the  circuits  indicates  that 
the  message  passing  approach  may  require  substantially  more  message  traffic  with  very  large 
numbers  of  processors  than  it  currently  does,  and  that  the  solution  quality  will  be  degraded  as 
well. 

5.3.2  Tradeoff  between  Locality  and  Load  Balancing 

The  requirement  that  a  parallel  program  be  load  balanced  is  also  a  limitation  on  the  benefit 
possible  from  methods  which  exploit  locality.  To  some  extent,  a  circuit  with  good  locality  will 
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Table  5:  Circuit  locality. 


Circuit 

Allocation 

strategy 

Total 

Wires 

W'ires  Held  for 
RountLrobin  asrot. 

Measure  of 
Locality 

bnrE 

round  robin 

420 

420 

1.96 

ThresholdCost  =  30 

209 

1.77 

| 

ThresholdCost  =  1000 

25 

1.30 

1 

ThresholdCost  =  inf 

0 

1.24 

1  MDC 

round  robin 

573 

573 

2.05 

ThresholdCost  =  30 

263 

1.58 

ThresholdCost  =  1000 

38 

1.04 

ThresholdCost  =  inf 

0 

0.99 

require  fewer  updates,  and  therefore,  less  time  to  execute.  However,  the  effect  of  a  load  imbalance 
can  outweigh  the  subtle  effect  of  the  difference  in  update  time.  Therefore,  in  terms  of  execution 
time,  the  optimal  point  is  neither  a  fully  load  balanced  circuit,  w'here  the  update  time  becomes 
significant,  nor  a  fully  local  circuit,  where  the  load  imbalances  become  significant,  but  rather  a 
point  between  the  two. 

The  data  from  Table  4  shows  this  quite  clearly,  with  the  optimal  execution  time  in  almost 
every  case  being  the  ThresholdCost  =  30  wire  assignment.  The  most  obvious  example  of  the 
negative  effect  of  locality  on  load  balancing  is  exhibited  in  the  move  from  the  point  with  infinite 
ThresholdCost  to  the  point  with  ThresholdCost  equal  to  1000.  For  example  in  the  MDC  circuit, 
this  change  in  the  way  the  last  38  wires  are  assigned  gives  as  much  as  a  73%  execution  time 
reduction.  However,  wire  assignments  which  use  the  most  locality  generally  give  the  best  solution 
quality.  When  processors  route  in  localized  regions,  each  has  a  fairly  consistent  view  of  the  area 
it  is  routing  in.  Ultimately,  this  is  a  more  effective  way  to  produce  good  solution  quality  than 
nonlocalized  routing  with  periodic  updates. 


6  Conclusions 

The  goals  of  this  research  were  to  re-evaluate  the  tradeoffs  between  shared  memory  and  message 
passing  architectures  in  light  of  the  new  features  becoming  prevalent  in  the  two  architectures. 
Specifically,  we  studied  the  level  of  traffic  required  by  each  approach  to  maintain  consistency, 
and  the  effect  of  exploiting  locality  on  this  traffic.  Although  this  study  provides  data  from  a 
single  application,  we  feel  that  it  is  representative  of  a  class  of  applications  which  do  not  require 
the  strict  consistency  enforced  by  hardware  cache  coherence  schemes. 

We  show  that  implementing  the  LocusRoute  application  on  a  message  passing  machine  can 
result  in  a  dramatic  decrease  in  the  amount  of  interconnection  network  traffic,  with  only  a 
small  negative  effect  on  the  solution  quality.  This  is  especially  impressive  because  LocusRoute 
has  been  touted  as  an  excellent  application  for  a  shared  memory  architecture.  However,  this 
dramatic  improvement  did  have  a  cost.  The  explicit  control  afforded,  and  in  fact  required,  by 
the  message  passing  architecture  requires  significantly  larger  programming  effort.  The  decisions 
of  how  to  partition  the  cost  array  among  the  processors,  how  to  initiate  updates,  how  frequently 
updates  should  occur,  and  how  to  assign  wires  to  processors,  all  involve  complex  tradeoffs  and 


much  programming. 

We  further  show  that  exploiting  locality  in  the  message  passing  case  can  have  a  positive 
effect  on  all  three  of  the  factors  studied  in  this  paper:  solution  quality,  network  traffic,  and  to  a 
lesser  extent,  execution  time.  What,  then,  is  the  cost  of  exploiting  it?  The  answer,  once  again,  is 
that  exploiting  locality  currently  requires  more  programmer  effort  than  using  a  simpler  method 
which  does  not  exploit  it. 

In  this  paper,  we  also  studied  the  shared  memory  approach,  examining  the  traffic  necessary 
to  maintain  cache  consistency.  We  found  that  the  bus  traffic  in  a  shared  memory  architecture 
can  be  as  much  as  two  orders  of  magnitude  larger  than  the  network  traffic  in  a  shared  memory 
approach.  In  an  absolute  sense,  however,  the  amount  of  traffic  is  nos  excessive.  Fair  this  reason, 
in  applications  where  the  improved  solution  quality  given  by  the  shared  memory  implementation 
is  important,  it  appears  to  be  the  correct  choice.  Perhaps  the  most  compelling  benefit,  however, 
of  the  shared  memory  architecture  is  the  easy  and  natural  programming  environment  it  provides. 
The  hardware  cache  coherence  enforced  enables  programs  to  be  developed  and  debugged  more 
quickly  and  easily. 

We  also  presented  data  indicating  that  traditional  bus-based  shared  memory  machines  are 
not  extremely  sensitive  to  exploiting  locality.  In  the  LocusRoute  application,  this  is  in  part  due 
to  the  difficulty  of  singling  out  the  array  elements  needed  by  each  processor.  In  general,  even 
in  the  most  local  wire  assignment  method,  interference  between  processors  is  still  a  problem. 
Sharing  of  cost  array  information  leads  to  a  large  number  of  invalidations  and  refetches.  However, 
future  machines  relying  on  hierarchies  to  scale  the  total  number  of  processors,  are  expected  to 
be  more  sensitive  to  locality.  As  these  architectures  become  available,  more  research  will  be 
needed  to  automatically  detect  and  exploit  locality  in  parallel  programs. 
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